Skip to content

perf(vanilla-io_uring): converge onto the epoll twin — zero-alloc rendering, /pipeline skip-decode, crud slab (impl vanilla#84, #85)#967

Open
enghitalo wants to merge 2 commits into
MDA2AV:mainfrom
enghitalo:perf/vanilla-io_uring-converge
Open

perf(vanilla-io_uring): converge onto the epoll twin — zero-alloc rendering, /pipeline skip-decode, crud slab (impl vanilla#84, #85)#967
enghitalo wants to merge 2 commits into
MDA2AV:mainfrom
enghitalo:perf/vanilla-io_uring-converge

Conversation

@enghitalo

Copy link
Copy Markdown
Contributor

What & why

vanilla-io_uring was an under-optimized copy of its vanilla-epoll twin: same handlers, but allocating throwaway strings per request, using json.encode/json.decode reflection on the DB paths, and fully parsing the request even for the fixed /pipeline blit. This PR ports every backend-agnostic optimization from the epoll entry so the two share one audited set of response builders and diff cleanly.

The io_uring backend supports only a stateless request_handler (no async_handler / TLS — see enghitalo/vanilla#83), so DB access stays on the blocking db.pg client. Everything that does not require the async runtime now matches epoll byte-for-byte.

Implements enghitalo/vanilla#84 (zero-alloc int parse/format) and enghitalo/vanilla#85 (crud: 1-query list, byte-rendered GET, fast crud-body parse).

Supersedes #965 (same work, rebased clean with the static fix folded in — no regression-then-fix history).

Changes (all entry-only, no lib change)

  • wi is now negative-aware — fixes a latent wrong body for a negative /baseline11 sum (a=-10&b=3 now returns -7).
  • emit / emit_int (stack scratch) / emit_xcache — zero-alloc response framing. /baseline11 and /upload no longer allocate an int -> string per request.
  • /pipeline skip-decode fast path — blit the constant before any parsing; decode_into (no Result boxing) on the main parse path.
  • render_item_pg — byte-level JSON straight from db.pg text rows, removing the per-request json.encode reflection on /async-db, /crud list and /crud GET.
  • crud cache is an id-indexed slab (replaces map[int]string) with in-place buffer reuse + cache-aside invalidation, shared across ring workers under RwMutex.
  • crud_list uses a single windowed query (count(*) OVER()) instead of a page SELECT + a separate count(*).
  • parse_crud_body_fast + borrowed JSON field parsers (json.decode fallback kept for escaped bodies).
  • parse_i64_slice / dechunk_into / parse_hex_slice — allocation-free query/body parsing.

Static: unlike the epoll twin, this entry does not set static_assets.sendfile_min_bytes. The io_uring backend has no sendfile path (no core.enable_sendfile / queue_file drain), so a low threshold makes static_assets read every large .br/.gz sibling from disk per request (blocking the ring). Keeping the default (256 KiB) preloads all arena siblings (< 256 KiB) and serves them as a zero-copy core.queue_buf borrowed send.

What is intentionally NOT changed

DB profiles (fortunes, async-db, api-*, crud) stay capped by the blocking db.pg on the single ring worker — the io_uring backend has no async runtime to await DB readiness on the ring (enghitalo/vanilla#83). A per-worker reused render scratch (dropping the small per-request DB-render buffers) is a follow-up now that io_uring supports make_state (enghitalo/vanilla#93 landed; entry follow-up enghitalo/vanilla#97).

Validation

  • Both images build (v -prod -d vanilla_tls).
  • Ran both containers against a pristine seeded Postgres (fresh DB per framework) and diffed every route — pipeline, baseline11 (positive and negative), upload, json, json-comp, async-db, fortunes, static (br negotiation), crud list, crud GET (MISS→HIT), crud create/update, 404, json-tlsall 17 byte-for-byte identical to vanilla-epoll.
  • X-Cache verified: GET MISS → HIT, re-MISS after a PUT (slab invalidation); POST → 201; json-compContent-Encoding: gzip; json-tls → 200 over TLS 1.3.
  • Static (the path that regressed in perf(vanilla-io_uring): converge onto the epoll twin — zero-alloc rendering, /pipeline skip-decode, crud slab (impl vanilla#84, #85) #965 before the fix): wrk at 64 conns on /static/vendor.js (→ 67 KB .br) serves ~101k req/s / 6.38 GB/s, zero socket errors (preloaded, bandwidth-bound).

🤖 Generated with Claude Code

…ering, /pipeline skip-decode, crud slab)

Port every backend-agnostic optimization from the vanilla-epoll entry so the two share
one audited set of response builders and diff cleanly. The io_uring backend supports
only a stateless request_handler (no async_handler / TLS — enghitalo/vanilla#83), so DB
access stays on the blocking db.pg client; everything else now matches epoll byte-for-byte
(verified: all 17 routes identical against a pristine seeded Postgres).

Implements enghitalo/vanilla#84 (zero-alloc int parse/format) and MDA2AV#85 (crud: 1-query
list, byte-rendered GET, fast body parse):

- wi: negative-aware (fixes a latent wrong body for a negative /baseline11 sum)
- emit / emit_int (stack scratch) / emit_xcache: zero-alloc response framing; /baseline11
  and /upload no longer allocate an int->string per request
- /pipeline: skip-decode fast path (blit the const before parsing) + decode_into
- render_item_pg: byte-level JSON straight from db.pg text rows — removes the per-request
  json.encode reflection on /async-db, /crud list, /crud GET
- crud cache: id-indexed slab (replaces map[int]string) with in-place buffer reuse and
  cache-aside invalidation, shared across ring workers under RwMutex
- crud_list: single windowed query (count(*) OVER()) instead of page + separate count(*)
- parse_crud_body_fast + borrowed json field parsers (json.decode fallback kept)
- parse_i64_slice / dechunk_into / parse_hex_slice: allocation-free parsing

Static: unlike the epoll twin, this does NOT set static_assets.sendfile_min_bytes — the
io_uring backend has no sendfile path (no core.enable_sendfile / queue_file drain), so a
low threshold would make static_assets read every large .br/.gz sibling from disk per
request (blocking the ring). Keeping the default (256 KiB) preloads all arena siblings
(< 256 KiB) and serves them as a zero-copy core.queue_buf borrowed send.

DB profiles remain capped by the blocking db.pg on the single ring worker
(enghitalo/vanilla#83). Per-worker reused render scratch is a follow-up now that io_uring
supports make_state (enghitalo/vanilla#93 done; entry follow-up enghitalo/vanilla#97).

Verified: both images build; every route byte-identical to vanilla-epoll on a pristine
seeded Postgres; X-Cache MISS->HIT->re-MISS-after-PUT holds; /static/vendor.js (67 KB .br)
serves ~101k req/s / 6.38 GB/s under wrk (preloaded, no disk-read collapse).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@enghitalo

Copy link
Copy Markdown
Contributor Author

/benchmark -f vanilla-io_uring

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

👋 /benchmark request received. A collaborator will review and approve the run.

…andler) — closes vanilla#97

Now that the io_uring backend supports make_state / stateful_handler (enghitalo/vanilla#94),
adopt the same per-worker-state shape as the epoll twin so the two entries are maximally
converged — only db.pg-blocking vs pg_async-async (enghitalo/vanilla#83) now separates
their handler code.

- Split Shared into SharedRO (process-shared: dataset, prefixes, asv, the shared thread-
  safe db.pg pool, and the mutex-guarded crud slab + gz cache) and a per-worker
  WorkerCtx { ro, scratch }.
- Dispatch through stateful_handler + make_state: each ring worker builds ONE WorkerCtx
  (its own reused render scratch), dropping the per-request []u8 the DB render paths
  (write_async_db / write_crud_list / write_crud_get MISS / write_fortunes) allocated —
  addresses the api-* memory growth seen in the CI run. High-RPS non-DB paths are
  unchanged (pipeline/baseline/json/static stay zero-alloc).
- Bump the pinned vanilla lib b189036 -> 6fb4244 (includes MDA2AV#94): the old pin REJECTS
  stateful_handler on io_uring at new_server() and panics on boot.

Verified: image builds; all 17 routes byte-identical to vanilla-epoll on a pristine
seeded Postgres (X-Cache MISS->HIT->re-MISS-after-PUT holds); wrk healthy on every path
(pipeline 241k, baseline 242k, json 194k, static/vendor.js 98k, crud 237k rps, zero
socket errors) — the stateful dispatch adds no measurable overhead and static stays fast.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@enghitalo

Copy link
Copy Markdown
Contributor Author

Pushed a follow-up commit (44ca0b6) that fully converges the io_uring entry onto the epoll twin, now that the io_uring backend supports per-worker state (enghitalo/vanilla#94, merged):

  • Split SharedSharedRO (process-shared data + mutex-guarded crud slab / gz cache) + per-worker WorkerCtx { ro, scratch }, dispatched via stateful_handler + make_state — same shape as vanilla-epoll.
  • DB render paths (write_async_db / write_crud_list / write_crud_get MISS / write_fortunes) now render into the worker's reused scratch instead of a per-request []u8 — removes the api-* memory growth from the earlier run. High-RPS non-DB paths unchanged.
  • Bumped the pinned vanilla lib b1890366fb4244 (includes Fix Json serializtion on cheating frameworks #94; the old pin rejects stateful_handler on io_uring at new_server() and panics on boot).

Closes enghitalo/vanilla#97. The only remaining handler-code divergence from the epoll twin is blocking db.pg vs async pg_async (enghitalo/vanilla#83).

Re-verified: all 17 routes byte-identical to vanilla-epoll on a pristine seeded Postgres; wrk healthy on every path (pipeline 241k, baseline 242k, json 194k, static/vendor.js 98k, crud 237k rps, zero socket errors).

@enghitalo

Copy link
Copy Markdown
Contributor Author

/benchmark -f vanilla-io_uring

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

👋 /benchmark request received. A collaborator will review and approve the run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant